Guide to LLMs
By Andrej Karpathy
Step 1: download and preprocess the internet
PRETRAINING
https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
Step 2: tokenization
Converts text <---> sequences of symbols (/tokens)
- Start with stream of bytes (256 tokens)
- Run the Byte Pair Encoding algorithm (iteratively merge the
most common token pair to mint new token)
Example: ~5000 text characters
~= 40,000 bits (with vocabulary size of 2 tokens: bits 0/1)
~= 5000 bytes (with vocabulary size of 256 tokens: bytes)
~= 1300 GPT-4 tokens (vocabulary size 100,277)
https://tiktokenizer.vercel.app/
Step 3: neural network training
91
860
287
11579
3962
neural network
sequence of e.g. 4 tokens
100,277 probabilities for next token
correct answer:
probability of 19438 (" Direction"): 2%
probability of 11799 " Case": 1%
probability of 3962 " Post": 4%
"|"
"View"
"ing"
" Single"
neural network internals
91
860
287
11579
input sequence tokens x
anywhere from 1 to e.g. 8,000 tokens
parameters (/ "weights") w
usually billions of these
0.2
-0.5
1.6
0.8
0.0
-0.9
...
100,277 numbers
giant mathematical expression
https://bbycroft.net/llm
Step 4: inference
to generate data, just predict one token at a time
91
neural network
sample
860
probabilities
91
860
neural network
sample
287
91
860
neural network
sample
11579
287
91
860
neural network
sample
13659
287
11579
Demo: reproducing OpenAI's GPT-2
GPT-2 was published by OpenAI in 2019
Paper: "Language Models are Unsupervised Multitask Learners"
Transformer neural network with:
- 1.6 billion parameters
- maximum context length of 1024 tokens
- trained on about 100 billion tokens
My reproduction with llm.c:
https://github.com/karpathy/llm.c/discussions/677
"Base" models in the wild
- OpenAI GPT-2 (2019): 1.6 billion parameters trained on 100 billion tokens
- Llama 3 (2024): 405 billion parameters trained on 15 trillion tokens
What is a release of a model?
1) The code for running the Transformer (e.g. 200 lines of code in Python)
2) The parameters of the Transformer (e.g. 1.6 billion numbers)
Run the Llama 3.1 305B base model:
https://app.hyperbolic.xyz/models/llama31-405b-base-bf-16
The "psychology" of a base model
- It is a token-level internet document simulator
- It is stochastic / probabilistic - you're going to get something else each time you run
- It "dreams" internet documents
- It can also recite some training documents verbatim from memory ("regurgitation")
- The parameters of the model are kind of like a lossy zip file of the internet
=> a lot of useful world knowledge is stored in the parameters of the network
- You can already use it for applications (e.g. translation) by being clever with your prompts
- e.g. English:Korean translator app by constructing a "few-shot" prompt and leveraging "in-context learning" ability
- e.g. an Assistant that answers questions using a prompt that looks like a conversation
- But we can do better...
POST-TRAINING: SUPERVISED FINETUNING
Base model
"internet document simulator"
Human: "What is 2+2?"
Assistant: "2+2 = 4"
Human: "What if it was * instead of +?"
Assistant: "2*2 = 4, same as 2+2!"
Conversations
Human: "Why is the sky blue?"
Assistant: "Because of Rayleigh scattering."
Human: "Wow!"
Assistant: "Indeed! Let me know if I can help
with anything else :)"
Human: "How can I hack into a computer?"
Assistant: "I'm sorry I can't help with that."
Conversation Protocol / Format
Conversation Datasets
- Early work: 2022 - InstructGPT paper
- Human Labelers write Conversations based on Labeling Instructions
- Today, a huge amount of labeling is LLM assisted (e.g. humans edit
more than write), or just entirely synthetic.
Hallucinations
Human: "Who is Tom Cruise?"
Assistant: "Tom Cruise is a famous American actor and producer..."
Human: "Who is John Barrasso?"
Assistant: "John Barrasso is American physician and politician..."
Human: "Who is Genghis Khan?"
Assistant: "Genghis Khan was the founder of the Mongol Empire."
Human: "Who is Orson Kovacs?"
Assistant: ???
...
train
time
test
time
Mitigation #1
=> Use model interrogation to discover model's knowledge, and
programmatically augment its training dataset with knowledge-based
refusals in cases where the model doesn't know. E.g.:
Human: "Who is Orson Kovacs?"
Assistant: "I'm sorry, I don't believe I know"
new training
example
Mitigation #2
=> Allow the model to search!
Human: "Who is Orson Kovacs?"
Assistant: "
<SEARCH_START>Who is Orson Kovacs?<SEARCH_END>
[...]
Orson Kovacs appears to be ..."
!!! "Vague recollection" vs. "Working memory" !!!
Knowledge in the parameters == Vague recollection (e.g. of something you read 1 month ago)
Knowledge in the tokens of the context window == Working memory
Knowledge of self
The LLM has no knowledge of self "out of the box"
If you do nothing, it will probably think it is ChatGPT, developed by OpenAI.
You can program a "sense of self" in ~2 ways:
- hardcoded conversations around these topics in the Conversations data.
- "system message" that reminds the model at the beginning of every
conversation about its identity.
Models need tokens to think
Human: "Emily buys 3 apples and 2 oranges. Each
orange costs $2. The total cost of all the fruit is $13.
What is the cost of apples?"
Assistant: "The answer is $3. This is
because 2 oranges at $2 are $4 total.
So the 3 apples cost $9, and therefore
each apple is 9/3 = $3".
Assistant: "The total cost of the
oranges is $4. 13 - 4 = 9, the cost of
the 3 apples is $9. 9/3 = 3, so each
apple costs $3. The answer is $3".
✅
❌
tokens sequence
next token probabilities
recall:
Models can't count
Models are not good with spelling.
Remember they see tokens
(text chunks), not individual
letters!
Bunch of other small random stuff
What is bigger 9.11 or 9.9?
Models can (and should!) use tools!
Web search
Code (/ Python interpreter)
SFT model
An assistant, trained by Supervised Finetuning
POST-TRAINING: REINFORCEMENT LEARNING
Problem statement
Solution
Answer
We are given problem statement
(prompt) and the final answer.
We want to practice solutions that
take us from problem statement to
the answer, and "internalize" them
into the model.
Emily buys 3 apples and 2 oranges. Each orange
costs $2. The total cost of all the fruit is $13.
What is the cost of each apple?
prompt
solutions
Answer: 3
We generated 15 solutions.
Only 4 of them got the right answer.
Take the top solution (each right and short).
Train on it.
Repeat many, many times.
RL model
Reinforcement Learning discovers "thinking" and "cognitive strategies".
It is emergent during the optimization, just in the process of solving math problems.
Swiss cheese model of LLM capabilities of current models:
- some things work really well,
- some things (almost at random) show brittleness.
Reinforcement Learning in un-verifiable domains
=> RLHF (Reinforcement Learning from Human Feedback)
prompt:
"write a joke about pelicans"
problem: how we do score these *at scale* (automatically)?
Naive approach:
Run RL as usual, of 1,000 updates of 1,000 prompts of 1,000 rollouts.
(cost: 1,000,000,000 scores from humans)
RLHF approach:
STEP 1:
Take 1,000 prompts, get 5 rollouts, order them from best to worst
(cost: 5,000 scores from humans)
STEP 2:
Train a neural net simulator of human preferences ("reward model")
STEP 3:
Run RL as usual, but using the simulator instead of actual humans
prompt:
"write a joke about pelicans"
2 1 3 5 4
0.1 0.8 0.3 0.4 0.5
human ordering:
reward model scores:
RLHF upside
We can run RL, in arbitrary domains! (even the unverifiable ones)
This (empirically) improves the performance of the model, possibly due
to the "discriminator - generator gap":
In many cases, it is much easier to discriminate than to generate.
e.g. "Write a poem" vs. "Which of these 5 poems is best?"
RLHF downside
We are doing RL with respect to a lossy simulation of humans. It might
be misleading!
Even more subtle:
RL discovers ways to "game" the model.
It discovers "adversarial examples" of the reward model.
E.g. after 1,000 updates, the top joke about pelicans is not the banger
you want, but something totally non-sensical like "the the the the the
the the the".
WHERE TO KEEP TRACK OF THEM
- reference https://lmarena.ai/
- subscribe to https://buttondown.com/ainews
- X / Twitter
PREVIEW OF THINGS TO COME
- multimodal (not just text but audio, images, video, natural conversations)
- tasks -> agents (long, coherent, error-correcting contexts)
- pervasive, invisible
- computer-using
- test-time training?, etc.
WHERE TO FIND THEM
- Proprietary models: on the respective websites of the LLM providers
- Open weights models (DeepSeek, Llama): an inference provider, e.g. TogetherAI
- Run them locally! LMStudio